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1 Page placement algorithms for large real-indexed caches 
R. E. Kessler, Mark D. Hill 

November 1992 ACM Transactions on Computer Systems (TOCS), volume lo issue 4 

Full text available- pdf{1 65 MB) Additional Information: MLciMiQn, aj?stract, .references, citings, .index ' 
' ^ terms 

When a computer system supports both paged virtual memory and large real-indexed 
caches, cache performance depends in part on the main memory page placement. To date, 
most operating systems place pages by selecting an arbitrary page frame from a pool of 
page frames that have been made available by the page replacement algorithm. We give a 
simple model that shows that this naive (arbitrary) page placement leads to up to 30% 
unnecessary cache conflicts. We develop several page placement algor ... 

^ UILB:..a.mechanism.for a^^^ 

Yuqun Chen, Angelos Bilas, Stefanos N. Damianakis, Cezary Dubnicki, Kai Li 

October 1998 Proceedings of the eighth international conference on Architectural 

support for programming languages and operating systems, volume 33 , 32 

Issue 11 , 5 

Additional Information: Ml.citatlon, abstract, references, citings, index 



Full text available: TODdf^l.TS MB) 



An important aspect of a high-speed network system is the ability to transfer data directly 
between the network Interface and application buffers. Such a direct data path requires the 
network interface to "know" the virtual-to-physical address translation of a user buffer, Le., 
the physical memory location of the buffer. This paper presents an efficient address 
translation architecture. User-managed TLB (UTLB), which eliminates system calls and 
device interrupts from the common co ... 



^ Effect of„node.size 

Richard A, Hankins, Jignesh M. Patel 

June 2003 ACM SIGMETRICS Performance Evaluation Review, Proceedings of the 
2003 ACM SIGMETRICS international conference on Measurement and 
modeling of computer systems, volume 3i issue i 

Full text available: ^Mfi2Zl 16. KBi Additional Information: falj.cjtatjon, abstract, refereaces, in^ex terms 

In main-memory databases, the number of processor cache misses has a critical impact on 
the performance of the system. Cache-conscious indices are designed to improve 
performance by reducing the number of processor cache misses that are incurred during a 
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search operation. Conventional wisdom suggests that the index's node size should be equal 
to the cache line size in order to minimize the number of cache misses and improve 
performance. As we show in this paper, this design choice ignores additi ... 

Keywords: B^-tree, cache-conscious, index 



* improving IPC by kernel design 
Jocfien Liedtke 

December 1993 ACM SIGOPS Operating Systems Review , Proceedings of the 

fourteenth ACM symposium on Operating systems principles. Volume 27 

Issue 5- 

Full text available: p pdf(1.39 MB) Additional Information: full citation, abstract, references, citings, index 

! terms 

Inter-process communication (ipc) lias to be fast and effective, otfierwlse programmers will 
not use remote procedure calls (RPC), multithreading and multitasking adequately. Thus ipc 
performance is vital for modern operating systems, especially μ -kernel based ones. 
Surprisingly, most μ-kernels exhibit poor ipc performance, typically requiring 100 
8imu;s for a short message transfer on a modern processor, running with 50 MHz clock 
rate. In contrast, we achieve 5 μs; a twenty ... 

5 Characterjzing the..^^^^ 
operating system 

Josep Torrellas, Anoop Gupta, John Hennessy 

September 1992 ACi^ SIGPLAN Notices , Proceedings of the fifth international 

conference on Architectural support for programming languages and 
operating systems, volume 27 issue 9 

Full text available: '^pdf(1.S2 MB) Additional Information: fuJJ.citation, references, cjtinas, indexjerrns 



^ CompjJer-based.i/0.pr^^^^^ 

Angela Demke Brown, Todd C. Mowry, Orran Krieger 

May 2001 ACM Transactions on Computer Systems (TOCS), volume 19 issue 2 

Full text available* S Ddf^499 03 KB) Additional Information: Ml citation, abstract, references, clttnqs. index 
^ .terms, reyjew 

Current operating systems offer poor performance when a numeric application's working set 
does not fit in main memory. As a result, programmers who wish to solve ''out-of-core" 
problems efficiently are typically faced with the onerous task of rewriting an application to 
use explicit I/O operations (e.g., read/write). In this paper, we propose and evaluate a fully 
automatic technique which liberates the programmer from this task, provides high 
performance, and requires onl^ minima ... 

'\- 

Keywords: compiler optimization, prefetching, virtual memory 



DJsco:.running.cpmm 

Edouard Bugnion, Scott Devine, Kinshuic Govil, Mendei Rosenblum 

November 1997 ACM Transactions on Computer Systems (TOCS), volume is issue 4 

Full text available: H Ddf(400 75 KB^ Additional Information: full citation. sfestracL references , citings. indM 
^ * terms, revjew 

In this article we examine the problem of extending modern operating systems to run 
efficiently on large-scale shared-memory multiprocessors without a large implementation 
effort. Our approach brings back an Idea popular In the 1970s: virtual machine monitors. 
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We use virtual machines to run multiple commodity operating systems on a scalable 
multiprocessor. This solution addresses many of the challenges facing the system software 
for these machines. We demonstrate our approach with a prototy ... 

Keywords: scalable multiprocessors, virtual machines 



CAMERA: introducing memory concepts via visualization 
Linda Null, Karishma Rao 

February 2005 ACM SIGCSE Bulletin , Proceedings of the 36th SIGCSE technical 

symposium on Computer science education, Volume 37 issue i 
Full text available: Ddg484..fiS KB) Additional Information: fuJicltatjonj abstract, references, index terms 

CAMERA, Cache and Memory Resource Allocation, is a collection of workbenches for cache 
mapping schemes (including direct, fully associative, and set associative) and virtual 
memory (including paging and TLBs). Its goals are to provide users with interactive 
tutorials and simulations to help them better understand the fundamental concepts of 
memory management. Implemented in Java Swing, these workbenches allow users to 
observe the processes of memory to cache mapping, and virtual memory us ... 

Keywords: computer memory workbenches, education, tutorial 



9 Automatic pool allocation: innprovina performance. by controiling data structure layout in g 
the.heap 

Chris Lattner, Vikram Adve 

May 2005 ACM SIGPLAN Notices , Proceedings of the 2005 ACM SIGPLAN conference 

on Programming language design and implementation, volume 40 issue 6 
Full text available: ^pdft21S.30 KB) Additional Information: tull citation, abstract, references. Index terms 

This paper describes Automatic Pool Allocation, a transformation framework that segregates 
distinct instances of heap-based data structures into seperate memory pools and allows 
heuristics to be used to partially control the internal layout of those data structures. The 
primary goal of this work is performance improvement, not automatic memory 
management, and the paper makes several new contributions. The key contribution is a 
new compiler algorithm for partitioning heap objects in impera ... 

Keywords: cache, data layout, pool allocation, recursive data structure, static analysis 



Application performance and flexibiiity on exokernel systems 

M. Frans Kaashoek, Dawson R. Engler, Gregory R. Ganger, Hector M. Briceno, Russell Hunt, 
David Mazi^res, Thomas Pinckney, Robert Grimm, John Jannotti, Kenneth Mackenzie 
October 1997 ACM SIGOPS Operating Systems Review , Proceedings of the sixteenth 

ACM symposium on Operating systems principles, volume 3i issue 5 
Full text available: pdf(2.39 hAlh Additional Information: full citation, references, citings, index terms 



Automatjcliling..aF.^ 
Zhiyuan Li, Yonghong Song 

November 2004 ACM Transactions on Programming Languages and Systems (TOPLAS), 

Volume 26 Issue 6 

Full text available: ^.pM§4ZJ.9 KB) Additional Information: MLcltatipn, abstrsct, references, index terms 

Iterative stencil loops are used in scientific programs to implement relaxation methods for 
numerical simulation and signal processing. Such loops iteratively modify the same array 
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elements over different time steps, which presents opportunities for the compiler to 
Improve the temporal data locality through loop tiling. This article presents a compiler 
framework for automatic tiling of iterative stencil loops, with the objective of improving the 
cache performance. The article first presents a ... . 

Keywords: Caches, loop transformations, optimizing compilers 



Towards a theory of cache-efficient algorithms 
Sandeep Sen, Siddhartha Chatterjee, Neeraj Dumir 
November 2002 Journal of the ACM (JACM), Volume 49 issue 6 

Full text available: '^p<ifC271,41KBj Additional Information: Ml cMiQn, abstcad, rjsferences, jMexJeirris 

We present a model that enables us to analyze the running time of an algorithm on a 
computer with a memory hierarchy with limited associativity, in terms of various cache 
parameters. Our cache model, an extension of Aggarwal and Vitter's I/O model, enables us 
to establish useful relationships between the cache complexity and the I/O complexity of 
computations. As a corollary, we obtain cache-efficient algorithms in the single-level cache 
model for fundamental problems like sorting, FFT, and an I ... 

Keywords: Hierarchical memory, I/O complexity, lower bound 




''^ BinaryiraQsJato 

Michael Gschwind, Kemal EbcioQIu, Erik Altman, Sumedh Sathaye 

May 2000 Proceedings of the 14th international conference on Supercomputing 

Full text available: ^pdfM.44 MB) Additional Information: full citation, abstract, references, index terms 

We describe the design issues In an Implementation of the ESA/39G architecture based on 
binary translation to a very long instruction word (VLIW) processor. During binary 
translation, complex ESA/390 instructions are decomposed into instruction ''primitives" 
which are then scheduled onto a wide-issue machine. The aim is to achieve high instruction 
level parallelism due to the increased scheduling and optimization opportunities which can 
be exploited by binary translation software ... 

* System rleveipowM..^^ 
Luca Benini, Giovanni de Micheli 

April 2000 ACM Transactions on Design Automation of Electronic Systems (TODAES), 

Volume 5 Issue 2 

Full text available- W \ pdf(385 2? KB) Additional Information: fylLcitatjpn^ abstract, references, cjtln^s, index 

~ terms 

This tutorial surveys design methods for energy-efficient system-level design. We consider 
electronic sytems consisting of a hardware platform and software layers. We consider the 
three major constituents of hardware that consume energy, namely computation, 
communication, and storage units, and we review methods of reducing their energy 
consumption. We also study models for analyzing the energy cost of software, and methods 
for energy-efficient software design and compilation. This survery ... 

^^lmplemmtatiQn.as^^^^ complete machine simulator 

Bill Clarke, Adam Czezowski, Peter Strazdins 

January 2002 Australian Computer Science Communications , Proceedings of the 

twenty-fifth Australasian conference on Computer science - Volume 4 

CRPITS '02, Volume 24 Issue 1 
Full text available: "^pmM MBl Additional Information: Ml.cjtatiQri. abstract, references. .index.te.rrDS 

In this paper we present work in progress in the development of a complete machine 
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simulator for the UltraSPARC, an implementation of the SPARC V9 architecture. The 
complexity of the UltraSPARC ISA presents many challenges in developing a reliable and yet 
reasonably efficient implementation of such a simulator. Our Implementation includes a 
heavily object-oriented design for the simulator modules and Infrastructure, caching of 
repeated computations for performance, adding an OS (system call) emu ... 

Keywords: SMP, SPARC V9 ISA, UltraSPARC, complete machine simulator, execution- 
driven simulation, object-oriented design 



Accelerating shared virtual memory via general-purpose netvyork interface support | 
Angelos Bilas, Dongming Jiang, Jaswinder Pal Singh 

February 2001 ACM Transactions on Computer Systems (TOCS), volume 19 issue i 

Full text available* f ^pdfM78.88 KB) Additional Information: MLQMtJon, abstract, refejences, indexlerrris, 

Clusters of symmetric multiprocessors (SMPs) are important platforms for high-performance 
computing. With the success of hardware cache-coherent distributed shared memory 
(DSM), a lot of effort has also been made to support the coherent shared-address-space 
programming model in software on dusters. Much research has been done in fast 
communication on clusters and in protocols for supporting software shared memory across 
them. However, the performance of software virtual memory (SVM) is sti ... 

Keywords: applications, clusters, shared virtual memory, system area networks 



^7 The Qpie compiler from row-maior source to Morton-ordered matrices - 
Steven T. Gabriel, David S. Wise 

June 2004 Proceedings of the 3rd workshop on Memory performance issues: in 
conjunction with the 31st international symposium on computer 
architecture 

Full text available: ^ Pdft45y.69 KB) Additional Information: full citation, abstract, references, index terns 

The Opie Project aims to develop a compiler to transform C codes written for row-major 
matrix representation into equivalent codes for Morton-order matrix representation, and to 
apply its techniques to other languages. Accepting a possible reduction in performance we 
seek to compile a library of usable code to support future development of new algorithms 
better suited to Morton-ordered matricesThis paper reports the formalism behind the OPIE 
compiler for C, its status: now compiling several sta ... 

Keywords: cache, paging, quadtrees, scientific computing 



IB Creating and preserving locality of java appiications at allocation and garbage 

collection tin^ies 

Yefim Shuf, Manish Gupta, Hubertus Franke, Andrew Appel, Jaswinder Pal Singh 
November 2002 ACM SIGPLAN Notices , Proceedings of the 17th ACi^ SIGPLAN 

conference on Object-oriented programming, systems, languages, and 

applications, volume 37 issue 11 

Full text available- f ^odfM 80.20 KB) Additional Information: full citation, abstract, references, citings, index 

terms 

The growing gap between processor and memory speeds is motivating the need for 
optimization strategies that improve data locality. A major challenge is to devise techniques 
suitable for pointer-intensive applications. This paper presents two techniques aimed at 
improving the memory behavior of pointer-intensive applications with dynamic memory 
allocation, such as those written in Java. First, we present an allocation time object 
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placement technique based on the recently introduced notion of p ... 

Keywords: JVM, Java, garbage collection, heap traversal, locality, locality based graph 
traversal, memory allocation, memory management, object co-allocation, object placement, 
prolific types, run-time systems 



^9 Session I: transformation: Metrics and models for reordering transformations 
■Michelle Mills Strout, Paul D. Hovland 

June 2004 Proceedings of the 2004 workshop on Memory system performance 

Full text available: '^.pdg25L48 KB) Additional Information: M, cltatjoo. abstract, references, jnde)^. terms 

Irregular applications frequently exhibit poor performance on contemporary computer 
architectures, in large part because of their inefficient use of the memory hierarchy. Run- 
time data, and iteration-reordering transformations have been shown to improve the 
locality and therefore the performance of irregular benchmarks. This paper describes 
models for determining which combination of run-time data- and iteration-reordering 
heuristics will result in the best performance for a given dataset. We pr ... 

Keywords: data locality, inspector/executor, locality metrics, optimization, run-time 
reordering transformations, spatial locality graph, temporal locality hypergraph 



Caching & file syst^^ Fast data-locality profiling of native execution 
Erik Berg, Erilc Hagersten 

June 2005 Proceedings of the 2005 ACM SIGMETRICS international conference on 
Measurement and modeling of computer systems 

Full text available: ' ^pdr{349.73 KB) Additional Information: fuli citation. <3bslract. refere-nces , index terms 

Performance tools based on hardware counters can efficiently profile the cache behavior of 
an application and help software developers improve its cache utilization. Simulator-based 
tools can potentially provide more insights and flexibility and model many different cache 
configurations, but have the drawback of large run-time overhead. We present StatCache, a 
performance tool based on a statistical cache model. It has a small run-time overhead while 
providing much of the flexibility of simulator ... 

Keywords: cache behavior, profiling tool 
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